Frequentist Evaluation of Intervals Estimated 
for a Binomial Parameter and for the Ratio of 

Poisson Means 

Robert D. Cousins, Kathryn E. Hymes, Jordan Tucker 

Dept. of Physics and Astronomy^ University of California^ Los Angeles^ California 

90095, USA 



Abstract 

Confidence intervals for a binomial parameter or for the ratio of Poisson means are 
commonly desired in high energy physics (HEP) applications such as measuring 
a detection efficiency or branching ratio. Due to the discreteness of the data, in 
both of these problems the frequentist coverage probability unfortunately depends 
on the unknown parameter. Trade-offs among desiderata have led to numerous sets 
of intervals in the statistics literature, while in HEP one typically encounters only 
the classic intervals of Clopper-Pearson (central intervals with no undercoverage 
but substantial over-coverage) or a few approximate methods which perform rather 
poorly. If strict coverage is relaxed, some sort of averaging is needed to compare 
intervals. In most of the statistics literature, this averaging is over different values 
of the unknown parameter, which is conceptually problematic from the frequentist 
point of view in which the unknown parameter is typically fixed. In contrast, we 
perform an (unconditional) average over observed data in the ratio-of-Poisson-means 
problem. If strict conditional coverage is desired, we recommend Clopper-Pearson 
intervals and intervals from inverting the likelihood ratio test (for central and non- 
central intervals, respectively). Lancaster's mid-P modification to either provides 
excellent unconditional average coverage in the ratio-of-Poisson-means problem. 
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1 Introduction 



The construction of confidence intervals for a binomial parameter (probability 
of success in a binomial distribution), while already performed by Clopper 
and Pearson (C-P) in 1934 [Ij, remains a topic of discussion in the modern 
statistics literature due to diflFerences in opinion about the best way to deal 
with imperfect coverage rooted in the discreteness of the observed number 
of successes. Clopper and Pearson's central intervals, while guaranteeing no 
under cover age, result in considerable overcoverage (conservatism). Numerous 
alternatives have been put forward in the intervening years, with reviews such 
as that by Brown, Cai, and Dasgupta |2j recommending for general use some 
sets of intervals which are less conservative than those of C-P, but which 
undercover for certain values of the binomial parameter. In this paper, we 
examine the problem from the point of view of high energy physics (HEP) 
applications, including the problem of confidence intervals for the ratio of 
Poisson means. The latter problem provides an additional frequentist criterion, 
not yet considered by Brown et al., for judging the merits of sets of intervals 
for a binomial parameter. 

Figure [T^ illustrates the issue to be addressed. For each value of the binomial 
parameter p, one supposes that it is the true but unknown value, and calcu- 
lates the long-run fraction of experiments for which that value is contained 
in ("covered by") the reported confidence intervals. In Fig. [T^, the number of 
trials is fixed at 10, the probabilities for the number of successes are calcu- 
lated from the binomial formula using the true value of p, and the central C-P 
confidence intervals with a confidence level (C.L.) of 68.27% are used. The 
upper and lower endpoints of the C-P interval are, respectively, 15.87% C.L. 
lower and upper one-sided confidence limits. The coverage of the one-sided 
confidence limits is always greater than or equal to 15.87%, with equality on 
a discrete finite set of values. As this set of points is different for upper and 
lower confidence limits, the coverage of the two-sided intervals in this exam- 
ple is always strictly greater than 68.27%, an unfortunate consequence of the 
discrete nature of the observation. 

For comparison. Fig. is the coverage plot for central intervals derived using 
a Bayesian technique with Jeffreys prior, as described below. The coverage 
oscillates around the nominal 68.27%, in a way that by eye seems to have an 
"average" value near 68.27%. The problem from the frequentist point of view 
is that such averaging over values of the unknown parameter is typically not 
appropriate since the unknown true value of p is fixed, i.e., not sampled from 
a distribution. 

The effect of discreteness is also displayed in Figs. [2^ and b, which show the 
coverage as a function of ntot, for fixed p = 0.1. The above four plots are 
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Fig. 1. (a) Coverage of 68.27% C.L. Clopper-Pearson intervals, and (b) coverage 
of intervals calculated using a Bayesian method with Jeffreys prior and containing 
68.27% posterior probability, both as a function of p, for fixed ntot — 10. (a) and 
(b) are horizontal slices of Figs. and b, respectively. 
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Fig. 2. (a) Coverage of 68.27% C.L. Clopper-Pearson intervals, and (b) coverage 
of intervals calculated using a Bayesian method with Jeffreys prior and containing 
68.27% posterior probability, as a function of ntot 5 foi" fixed p = 0.1. (a) and (b) are 
vertical slices of Figs. |3^ and b, respectively. 

horizontal and vertical slices of a much larger pattern of behavior displayed in 
Figs [3^ and b. In these two figures, and corresponding figures below, ACL is 
the difference between the actual coverage and the nominal coverage, in this 
case 68.27%. 



These are but two of many sets of intervals that have been proposed. The 
saw-tooth features of the coverage plots are intrinsic to all methods except the 
randomization technique (mentioned in Sec. [2]) which brings other difficulties. 
Which sets are deemed preferable depends on the value one attaches to never 
having under cover age, on what sort of averaging (if any) over values of p one 
allows, whether or not one desires central intervals, and additional issues such 
as whether one is especially concerned about behavior near the endpoints, 
p = and 1. 

In this paper, we emphasize that a frequentist averaging method, which aver- 
ages over repeatedly sampled data^ can be used to evaluate sets of intervals, 
in contrast to most previous averaging methods which average over the pa- 
rameter p in some metric. The frequentist average is performed by using the 
strong connection between confidence intervals for a binomial parameter and 
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Fig. 3. (a) Coverage of 68.27% C.L. Clopper-Pearson intervals, and (b) coverage 
of intervals calculated using a Bayesian method with Jeffreys prior and containing 
68.27% posterior probability, as a function of p and ntot- ACL is the difference 
between the actual coverage and the nominal coverage, 68.27%. 



confidence intervals for the ratio of two unknown Poisson means. For pairs of 
integers sampled from two fixed but unknown Poisson means, fluctuations in 
the total number of observed events provides a random sampling which par- 
tially smoothes out the saw-tooth structure seen in binomial coverage plots. 
Said another way (using terminology defined below) we use the unconditional 
global coverage as a criterion for averaging over imperfect conditional coverage 
of each fixed total number of events. 

In the traditional definition of "confidence interval" , defined by Neyman as we 
discuss below, the name implies no undercoverage for any value of the unknown 
parameter. When dealing with approximate methods, immaterial departures 
from perfect coverage are typically tolerated as long as it is clearly understood 
that coverage is only approximate. When methods yield intervals which are 
known to have non-negligible undercoverage for some values of the unknown 
parameter (such as for the mid-P intervals for the binomial parameter), the 
statistics literature is mixed on whether or not to refer to these intervals as 
confidence intervals. In this paper, we attempt to follow HEP practice by re- 
quiring no undercoverage when referring to intervals as "confidence intervals" . 

In Sec. [2| we review the relevant concepts from interval and hypothesis test 
construction and define the notation. In Sec.|3} we briefly describe a number of 
papers from the vast literature on binomial intervals. In Sec. [4j we generalize 
to the ratio-of-Poisson-means problem, and review some relevant literature. 
In Sec. [5] we present our results on the coverage of a number of the methods. 
We conclude in Sec. [HI 
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2 Definitions and Notation 



We let Bi(non|^tot5 p) denote the probability of successes in ntot trials, each 
with binomial parameter p: 

Bi(non|ntot, P) = p-o^ (1 - p)Kot-non)^ (1) 



In repeated trials, has mean 

ntotp (2) 

and rms deviation 



^totP(l - p). (3) 



For asymptotically large ntot, Bi can be approximated by a normal distribution 
with this mean and rms deviation. 

With observed number of successes non, the likelihood function C{p) follows 
from reading Eqn. [T] as a function of p. The maximum is at 

p = non/^tot- (4) 



In some applications, ntot is not fixed but is itself a random variable sampled 
from a Poisson distribution with mean ptot- 

Poi(ntotlMtot) = r • (5) 



In this case, non and noff = ntot — ^on can be considered to be independent 
random variables, each sampled from a Poisson distribution with means pon 
and /Xoff, respectively, satisfying 

Pon + Moff = Ptot- (6) 



The ratio of the Poisson means is then 

A = Pon/Pon, (7) 

and the binomial parameter can be written as 

P = Pon/ Plot = 1/(1 + A). (8) 
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The joint probability P(non, '^off) of observing Uon and nofr can then be ex- 
pressed in two equivalent ways: as the product of independent Poisson proba- 
bilities for and noff; or as the product of a single Poisson probability with 
mean /itot for the total number of events ntot, and the binomial probability 
that this total is divided as observed: 



p — Mon I ,Uon p — Moff //^off 



^tot! 

^tot! 



(9) 



^onK^tot - ^on)! 

In more compact notation, we have: 



P{non, noflf) = Poi(nonlMon) Poi(noff l/Xoff) (11) 

= Poi(ntot|Mtot) Bi(non|ntot,p). (12) 

This observation is the basis of hypothesis tests on the ratio of Poisson means 
going back to Przyborowski and Wilenski [3] in 1940, as recommended in 
HEP by James and Roos [4], and as discussed by statistician Reid [5j. All 
the dependence on ratio of Poisson means A is in the conditional binomial 
probability for the observed "successes" rion, given the observed total number 
of events ntot = ^on + ^off • 

We consider a general parameter 6 (such as p or A) and randomly sampled 
data (such as rion or other observables) , the probability of which depends on 
6. We then consider a recipe for computing the endpoints of a confidence 
interval [tiow^^up] for ^, as functions of the (randomly sampled) data. (In this 
paper we always include the endpoints in the confidence interval.) The set 
of all confidence intervals obtainable from all possible data sets using this 
recipe is called a confidence set. For each value of 0, one can then compute the 
probability that that 9 is contained in ( "covered by" ) the confidence intervals 
in the confidence set, for data sampled according to that 6. Normally it is 
highly desirable that this coverage probability be independent of 0, and is 
called the confidence coefficient or (more commonly in HEP) the confidence 
level (C.L.) of the confidence set. For situations such as those in this paper, in 
which the data takes on only discrete values, the coverage probability depends 
on ^, as illustrated above in Figs. [T^ and b. 

In classical hypothesis testing, a common hypothesis test is that which tests 
the hypothesis that 6 is equal to a particular value, ^o, against the alter- 
native that 9 7^ ^0- One constructs recipes for accepting/rejecting based 
on the (randomly sampled) data, the probability for which depends on 9. One 
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defines the significance level a of the test (also called size of the test) as the 
probability of rejecting if is true; again it is desirable that a is independent 
of 6. In the formal theory of Neyman-Pearson hypothesis testing, a is specified 
in advance; once data are obtained, the p-value is the smallest value of a for 
which Hq would be rejected. 

As discussed by Kendall and Stuart and successors [6j, one can construct a 
hypothesis test at significance level a simply by using a confidence set with 
C.L. = 1 — a and accepting Hq if Oq is contained in the confidence interval 
for 9 based on the obtained data. One can equally well derive confidence sets 
from any given recipe for testing the hypothesis 9 = simply by including 
in the interval those values of 9o which would not be rejected by such a test. 
This way of constructing confidence intervals is referred to in the statistics 
literature as "inverting the hypothesis test" . (An example now familiar in the 
HEP literature is the set of intervals advocated by Feldman and Cousins [7], 
which are constructed by inverting the likelihood ratio test of Ref. |6j.) It can 
happen that the resulting "intervals" are not simply connected, in which case 
various adjustments are typically made, for example adding to the interval 
any interior points not initially part of it (thus adding to the over-coverage). 

In this duality, confidence intervals formed by inverting a test with significance 
level a have coverage probability = 1 — a under i/g, i.e., 

^"(^0 ^ [tiow,tup]) = l-a. (13) 

Central confidence intervals have the additional property that the intervals 
[^iow5^max] and [tmin^^up] ^ach Separately have coverage probability = 1 — a/2, 
i.e., 

^"(^0 ^ [tlow, tmax]) = P{0o G [t^,n. t^p]) = 1 " C,/2, (14) 

where t^ax and imin are the maximum and minimum values of 9 defined in the 
model (e.g., tmin = and tmax = 1 if ^ is a binomial parameter). In this case, 
for example, tup is often referred to as a (1 — a/2) C.L. upper confidence limit 
for 9. 

If, due to the discreteness, the significance level can only be specified to be 
less than or equal to a, then the equal signs in Eqns. [13] and [14] become ">". 

Without invoking a Gaussian approximation in the construction of an inter- 
val itself, it is often useful to make the correspondence with the number of 
Gaussian standard deviations having a single-tdiiled probability equal to a/2. 
Thus, Z denotes the number of standard deviations away from the center of 
a Gaussian distribution, with a subscript representing the (one-tailed) tail 
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probability beyond that Z. 

Z«/2 = $-^(l-«/2) = -$-i(«/2) 



(15) 



where 



z 



1 + eri{Z/V2) 
2 



HZ) 



V271 



I 



exp(-tV2) dt 



(16) 



— OO 



SO that 



Z = V2eTr\l-a). 



(17) 



E.g., Z^/2 = 1 for a/2 = 0.159, and Z^/2 = 1.64 for a/2 = 0.05. 



3 Recipes for intervals for p 

A plethora of recipes exists for intervals approximating confidence intervals 
for binomial parameter p. They correspond to various choices regarding: 

• Whether or not the intervals are central intervals; 

• Whether or not the intervals are derived from rigorously inverting a hypoth- 
esis test (in which case, which test?); 

• Whether or not an asymptotic approximation is invoked; 

• Whether or not Bayesian machinery is used to derive the intervals; 

• Whether or not so-called "corrections" are made in an attempt to improve 
the coverage probability. 

As emphasized by Cai [8], some methods with bad properties as one-sided 
intervals have good properties as two-sided intervals due to cancellations in 
coverage departures between the two tails. 

3.1 Asymptotic approximations 

We begin with one of the most popular methods, which is also one of the 
worst-performing if not the worst-performing of popular methods. We follow 
the literature in referring to this interval as the Wald interval. After estimating 

P = rion/ntot, (18) 
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the Wald method invokes the Gaussian approximation without properly invert- 
ing the hypothesis test against the null^ but rather simply substituting p for p 
into Eqn. [3] and using this fixed value of the rms to obtain the two endpoints, 



i,±Z^J>!^. (19) 

V ^tot 



Already in 1927, Edwin Wilson [9j realized that since the rms depends on 
the unknown parameter p, the more appropriate way to invoke the Gaussian 
approximation was by consistently inverting the test using the rms of the null 
hypothesis for each value of p. For the lower endpoint, one uses the lowest value 



pi such that pi + Z^i2\l P\{\ — Pi)/ntot contains p. Analogously for the upper 

endpoint, one uses the largest value p2 such that p2 — Zc,/2\l P2{^ P2)/^tot 
contains p. Letting T = (>^a/2)^/^tot5 this leads to a quadratic equation in p 
for the endpoints, (p — p)^ = Tp(l — p), with solutions 



p + T/2 Jp{l-P)T + Ty4 



These endpoints form the Wilson score interval] in spite of the fact that it 
is a non-iterative solution using nothing more than a square root, sadly it is 
commonly overlooked in favor of the Wald interval when a quick Gaussian 
estimate is desired. 

Letting p denote the midpoint of the Wilson score interval, from Eqn. [20] one 
has 

._ p + r/2 _ no„ + (Z./2)V2 

^ 1 + T ntot + {Za/2r ■ ^ ^ 



As discussed in detail by Agresti and CouU [10], p differs from p by formally 
adding {Za/2y to the number of actual trials, and making half of them suc- 
cesses. It thus "shrinks" the maximum-likelihood point estimate p towards 
0.5. For 95% C.L., {Za/2 = 1.96), the easy-to-remember rule of thumb is sim- 
ply "add four trials with two successes" to obtain the (approximate) Wilson 
midpoint. For quick estimates one can use p rather than p (and ntot + (>Z'c,/2)^ 
rather than ntot) in the Wald formula (Eqn. [19]) and obtain intervals with 
surprisingly decent coverage, much better than when using p (and avoiding 
the useless result at extreme data). We refer to such intervals as general- 
ized Agresti- Coull (AC) intervals (adding "generalized" to the name given by 
Brown et al. [2j to distinguish from the simpler version). Agresti and Coull 
themselves (who regard the C-P intervals as not optimal for statistical practice 
due to their conservatism) advocate AC intervals for teaching and the Wilson 
score interval for statistical practice [llj. 
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The asymptotic theory in which log hkehhood-ratios are related to chi-square 
distributions p!2l[T3] provides another interval estimate. In the present case, 
the interval consists of all points satisfying 



As there is more than one use of the likelihood ratio for intervals in this paper, 



we refer unambiguously to intervals from Eqn. 22 as A(— 21n>C) intervals. 
In addition to the usual caution required in using asymptotic formulas for 
small numbers of events, in the present case there are well-known issues at the 



extrema of p, where the conditions of the asymptotic theory justifying Eqn. [22 
are not satisfied. 



As discussed by Cox and Hinkley [14], for the exponential family of distribu- 
tions, i.e., those of the form p(9) = exp(a(0)6(x) + c(9) + the transfor- 
mation to the "natural parameter" (j) = —a{6) and new data variable z = h{x) 
leads to some mathematical simplifications. The natural parameter for the 
binomial distribution is the logit function. 



(/) = ln(p/(l-p)), (23) 



also known as the log odds ratio; it is a convenient map from (0,1) to (— oc, oc) 
in a variety of contexts. Such non-linear maps are a reminder that the concept 
of "shortest" is metric-dependent: it is easy to find pairs of intervals whose 
relative length in p is reversed when transformed to (j). 

The logit makes the mathematics simpler, but as Cox and Hinkley note, 
whether this is really the best parametrization can depend on other consider- 
ations as well. Models involving the logit and its inverse have a long history 
and were used in the work that was awarded the 2000 Nobel Memorial Prize 
in Economics. In any case, one can apply the same sort of Gaussian approx- 
imation to the logit (j) as is applied in forming the Wald intervals for p. The 
maximum likelihood estimate (j) is obtained by plugging in p. The variance of (j) 
is estimated as ntot/(^on(^tot — ^on)) = V(p(l ~ P))- One then has an interval 
for 0, which can be mapped into an interval for p. As the formulas are singular 
for non = and rion = '^tot, patches are required, which are sometimes used for 
other values of TT/Qj^ as well, in particular adding 1/2 to both numerator and 
denominator in the logit formula [2lll5lll6j . 
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3.2 Neyman^s Construction: ^^Exact^^ Inversion of a Hypothesis Test 

Given any test statistic and an ordering defined for it, confidence intervals 
with minimum guaranteed coverage can be constructed by the technique of 
Neyman [l7j , which corresponds to inverting a hypothesis test with rigorously 
calculated significance level. Such methods are often called "exact" since ap- 
proximations are not made in the calculation of the probabilities, but as al- 
ready shown for Clopper- Pearson, the coverage is by no means "exactly" equal 
to the nominal C.L.! Analogous to the Neyman construction described in de- 
tail for a similar discrete problem in Ref. [7J, for each value of p one forms 
acceptance intervals by adding the probabilities Bi(non|^tot5 p) for observed 
until the threshold 1 — a is crossed. An auxiliary principle for the ordering 
in which the probabilities for the are to be added to the acceptance set 
must be specified. 

Clopper and Pearson [1] constructed central confidence intervals which remain 
the standard fTSj for those who insist (as has been common in HEP) that 
coverage is always rigorously respected; the ordering is performed separately 
on each end of the acceptance interval. As noted above, the cost is severe 
over-coverage for some values of p. Angus and Schafer [19] compute over- 
coverage of C-P intervals, pointing out that (1 — a) C.L. intervals can have 
coverage probability as high as {1 — a/2) for some values of the true p; in 
fact the coverage is always this high or larger if ntot is small enough that 
^tot < (1 - In a/ In 2). 

Sterne [20j, followed soon by Crow [2T], constructed sets of non-central inter- 
vals with guaranteed minimum coverage. The idea is to reduce over-coverage 
due to discreteness by relaxing the requirement in Eqn. [14] while retaining 
that in Eqn. [TSj An obvious ordering principle to start with is based on 
Bi 

^onl^tot^p)? i-G., adding points to (either end of) the acceptance interval 
in decreasing order of probability so as to minimize the length of the accep- 
tance interval. There is room for adjustment, however, since in many cases the 
acceptance interval can be shifted, keeping its length fixed while still maintain- 
ing coverage. As there is considerable ambiguity in the best way to make such 
adjustments, there have been numerous attempts to improve upon Sterne's 
non-central intervals, variously referred to as two-tailed or both-tailed inter- 
vals. 

Blyth and Still [22] give a very detailed discussion of the ambiguities encoun- 
tered in such both-tailed constructions. They list some desirable features of 
intervals and, while giving their preferences for resolving ambiguities, note 
that "We see no way of combining these desirable properties into a precise 
criterion that would be generally accepted." Casella [23j reviews Blyth and 
Still and their predecessors and describes a method for systematically further 
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reducing the length of confidence intervals obtained from such constructions: 
"... move all the lower endpoints of the intervals as far to the right as pos- 
sible." In commenting [24j on Brown et al., he strongly advocates covering 
at the nominal value or greater, preferring the Blyth-Still intervals with his 
length-reduction algorithm. Blaker [ 251126] discusses in enlightening detail var- 
ious both-tailed methods, arriving at intervals which have good properties 
(nesting) when viewed as a function of the confidence level. But Vos and 
Hudson [27j explain in detail how both-tailed tests, even those of Blaker [25\ . 
inevitably have some undesirable behavior due to discreteness. 

In HEP, Feldman and Cousins [7] popularized a Neyman construction that 
is equivalent to inverting the hypothesis test based on likelihood ratios. The 
likelihood-ratio ordering in the Neyman construction is based not on Bi(non|^tot5 
as used by Sterne, but on the likelihood ratio Bi(non|^tot5 p)/Bi(non|^tot5 p). 
The corresponding test, the likelihood ratio test (LR test), is one of the stan- 
dard methods in classical statistics [6j . Coverage of both-tailed intervals for p 
from such "exact inversion of the LR test" was illustrated by statisticians Cor- 
coran and Mehta [28j, who prefer over-coverage to under-coverage, and who 
advocate either these intervals or the Blyth-Still-Casella intervals. Ranucci [29] 
compares coverage plots of intervals for p from such likelihood-ratio ordering 
with the intervals of C-P and of Sterne. As mentioned above (and described 
in Sec. IV of Ref. [?]), some interior points can be absent from the "interval" 
after first inverting the LR test; if so, in the present paper they are added in 
order to make the interval simply connected. 



3.2.1 Randomization, Mid-P, and Continuity Correction 

In order to remove the over-coverage in Neyman constructions caused by the 
discreteness of the integer- valued observations such as that of C-P, in 1950 
Stevens [30] and others suggested adding a random number uniform on (0,1) 
to the observed integer, and performing the construction on the resultant con- 
tinuous variable. As discussed in detail in Ref. [6j, this technique, known as 
randomization^ results in shorter intervals and perfect coverage. But as this 
extra random number was to be chosen from a table of (uniform) random num- 
bers, it is rarely if ever used except in theoretical discussions. Reference [3l] 
discusses how more meaningful data-based uniform variates can be justifiably 
used in randomization of Poisson observations, but we do not pursue this 
approach in this paper. 

As an alternative to randomization, Lancaster [32] suggested in 1961 to deal 
with the discreteness issue in many distributions by quoting an intermediate 
value of the tail probability, since known as the "mid-P" value. For a one-sided 
test, it is the null probability of more extreme results plus (only) half the prob- 
ability of the observed non- It corresponds to randomization always with the 
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addition of 1/2 to the observed integer successes rather than addition of a 
uniform variate on (0,1). By using only half the probability rather than all the 
probability of the observed non, the mid-P is less than the strict value. As 
such, it has neither perfect coverage nor a guarantee against under cover age, 
but the mid-P has attracted much more of a following than randomization, 
as the result is not influenced by an arbitrary random number. Berry and Ar- 
mitage |i33j review mid-P intervals in various contexts including the binomial 
problem, suggesting that they can be appropriate when combining results from 
several studies. Agresti and Gottard [34,35j further advocate mid-P intervals, 
provide a useful overview, and provide a function for computing them in the 
statistical package R. 

Another commonly used device in dealing with discrete distributions is called 
(somewhat optimistically) a continuity correction^ for example adding or sub- 
tracting 1/2 (or more generally another constant) from the observed number 
of successes. Although there is some advocacy of continuity corrections with 
respect to the binomial problem in the literature, it appears that there are 
better-performing ways to deal with the discreteness [2.36j . 



3.3 Bayesian-inspired methods 



Intervals derived using Bayesian machinery [37] can be evaluated according 
to their frequent ist coverage properties, and there has long been interest in 
prior probability density functions ( "priors" ) which lead to Bayesian credible 
intervals possessing approximate frequentist coverage. Recent reviews of such 
"probability matching priors" are in Refs. [381139] . Since the work of Welch and 
Peers [10ll41ll42j . it has been recognized that JeflFreys's prior [1311^ (derived by 
Jeffreys under a diflFerent motivation) is the lowest-order probability matching 
prior for one parameter (although care must be taken in interpreting this 
result for a discrete distribution such as binomial). The Jeffreys prior for the 
binomial problem is 

p{p) oc , / , (24) 
VP(1 -p) 



which is a special case (with a = 6 = 1/2) of the two-parameter beta distribu- 
tion, which has pdf 

p{p;a,b)^p--\l-pf-\ (25) 



The beta distribution is closely linked to the binomial distribution ^37j, and 
varying a and b provides a family of priors (including the uniform prior with 
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a = b = 1). The posterior from a beta prior is also a beta distribution [371145] . 
and intervals can be obtained from it using various criteria such as length or 
centrality. 

Geisser [16] considers several noninformative priors in the Bayesian literature 
and advocates a prior uniform in p, rejecting the Jeffreys prior because it vio- 
lates the (strong) likelihood principle [37j. The Comments following Ref. |46] 
(by Bernardo, Novick, and Zellner) point out problems when p is near or 
1. Brenner and Quan [47] also advocate a prior uniform in p, apparently un- 
aware of the many issues |iH[l6] in trying to represent "no prior information" 
in a prior. Copas [48j emphasizes that Bayesian-derived results, such as those 
of Brenner and Quan, do not automatically have good frequentist properties, 
and in particular criticize the prior uniform in p. 

Rubin and Schenker [49j derive logit-based intervals using the Jeffreys prior, 
recalling earlier work (including Gart [T5]) connecting this approach to using 
asymptotic logit estimation after appending a half success and a half failure 
as mentioned above. They calculate coverage both for fixed values of p and 
for values averaged over the Jeffreys prior. 

3.4 Comparative studies 

Given the abundance of methods, a number of authors have compared them 
by various criteria such as average coverage or average length (both of which 
are metric dependent), behavior near the extrema of p, etc. There is no general 
agreement on even basic features, such as whether or not coverage should be 
respected everywhere or in an average sense. And as noted above, preferences 
can differ if one is concerned only with one-sided intervals. 

Reiczigel [50] advocates quoting an adjusted significance level based on cal- 
culated coverage rather than the nominal coverage used in the construction. 
Agresti [51] advocates inverting two-tailed tests (leading to non-central inter- 
vals) rather than two one-tailed tests. Puza and O'Neill [52j perform coverage 
studies and advocate a "new class" of C-P-inspired intervals which transition 
from one-sided to two-sided intervals. 

VoUset [53] reviews in detail thirteen methods, recommending a continuity- 
corrected Wilson score method (strongly disfavored by Refs. f2!'36]), but de- 
scribing as "safe" the C-P intervals, mid-P intervals, and Wilson score intervals 
without the continuity correction. He finds likelihood-ratio intervals to be too 
narrow for boundary outcomes. 

Edwardes [54j compared several methods using coverage averaged over a cho- 
sen metric, studying the behavior as a function of the constant used in the 
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continuity correction. Among many results, he finds good performance for a 
Wald logit interval with negative continuity correction. 

Newcombe [55] considers the "strict conservatism" of the C-P method to be 
"unnecessarily conservative" and compares it to the Wald and score methods, 
with and without continuity corrections, mid-P, and A(— 21n>C) methods, and 
appears to favor the mid-P and score methods. 

Lu [56] compares lengths of intervals made with Bayesian methods with beta 
function priors, with endpoints adjusted according to Blyth [57j. He discusses 
in some detail the beta-distribution formulas and their numerical evaluation. 

Agresti and Min [58j generally prefer both-tailed (non-central) tests if cover- 
age is strictly required (unless one specifically requires a one-sided-test), and 
recommend mid-P tests if not. They also discuss using unconditional coverage 
in eliminating nuisance parameters in the context of the difference in binomial 
parameters. 

Pires and Amado [59] compare 20 methods (counting various continuity correc- 
tions), with a table giving formulas for all of them. They prefer C-P intervals 
if coverage is strictly required, or the (ungeneralized) Agresti-CouU "add 4" 
method [10] if not. 

Brown et al. [2] consider the Clopper-Pearson intervals to be "wastefuUy con- 
servative" , and advocate three sets of intervals with coverage oscillating about 
the nominal value: the Wilson score interval, the Agresti-CouU interval, and 
a Bayesian interval with Jeffreys prior and equal tails (except when is at 
the extreme values, in which case they take one tail). 

Coverage plots are illustrated in Refs. [^ll28l[2M35ll4^^ 



4 Application to the Ratio of Poisson Means 



As discussed in Sec. [2} intervals for the ratio of Poisson means A are readily 
obtained from intervals for the binomial parameter p, and vice versa. The 
conditional coverage of \, given ntot, can be read off coverage plots for p using 
A = (1 — p)/p. However, we can also consider the unconditional coverage of 
as a function of the two unknown means /Xon and /ioff , as follows. 

Given fixed /ion, Moff (and hence A), a C.L., and a recipe for intervals, then 
for all pairs (non,'^off)5 one can calculate both the confidence interval for A 



([^low^^up]) and the probability P(non,^off) of obtaining that pair (Eqn. 10). 
From these one can calculate probabilities that A < tiow, that A G [tiow, ^up] and 
that A > tup- Figures and b have the results of such unconditional coverage 
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Fig. 4. Unconditional coverage of (a) 68.27% C.L. Clopper-Pearson intervals for the 
ratio of Poisson means A and (b) intervals for A calculated using a Bayesian method 
with Jeffreys prior and containing 68.27% posterior probability. As described in the 
text, the coverage as a function of (/Xon,Moff) is displayed equivalently as a function 
of (p, /itot)- ACL is the difference between the actual coverage and the nominal 
coverage, 68.27%. 




Fig. 5. Unconditional coverage of (a) 95% C.L. Clopper-Pearson intervals for A 
and (b) intervals for A calculated using a Bayesian method with Jeffreys prior and 
containing 95% posterior probability. 



for the Clopper-Pearson and Jeffreys-prior-based recipes used in the previous 
figures; Figs. [5^ and b contain the corresponding plots for 95% C.L. In order 
to facilitate comparison with the conditional coverage plots, the axes are /Xtot 
and p, from which one can make the translation to (/Xon,Mofr) via /Xon = PMtot 
and /ioff = (1 - p)Mtot. 

These plots of unconditional coverage thus average over observed ntot given 
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true values of /Xon and /Xoap, in contrast to nearly all previous studies which 
average over unknown true values of parameters. While the use of the uncon- 
ditional ensemble (rather than the restricted "conditional" ensemble having 
the observed ntot) goes against the mainstream statistical practice of using the 
conditional ensemble in a case such as this [5j , we believe that this frequentist 
averaging over data provides at least as good a way to average-out some dis- 
creteness effects as does the common averaging over p, which requires a choice 
of metric (often p itself, although one can argue that the metric in which the 
prior is uniform is the natural metric in a Bayesian calculation). The issue is 
discussed in detail in Ref . [S)J , which as mentioned below describes a construc- 
tion of central confidence intervals for the ratio of Poisson means having strict 
unconditional, but not conditional, coverage. (Averaging over observed data 
with diflFerent values of ntot was used in Ref. [60j to cancel out some under- 
and over-coverage in different values of ntot at each value of the ratio.) 

As apparent from Fig. [4^, applying Clopper-Pearson binomial confidence in- 
tervals to the ratio-of-Poisson-means problem further propagates the over- 
coverage due to the discreteness. We return to this important point in Secj5] 
below after first briefiy reviewing some previous work applying non-C-P bino- 
mial intervals to the ratio-of-Poisson-means problem. 

Price and Bonett [16j consider various solutions to the problem of the ratio of 
Poisson means from a broad point of view, including translating into this prob- 
lem the binomial confidence intervals of C-P [Ij, Wilson score [9j, and Agresti 
and CouU [lOj. They also consider recipes derived directly for the ratio prob- 
lem, namely a square-root transformation, an adjusted Wald log-linear model 
equivalent to the adjusted Wald logit formula mentioned above, and Bayesian 
methods including a Gamma prior for the ratio. Their conclusions depend as 
usual on considerations such as whether coverage is rigorously required, but 
tend to favor the adjusted Wald log-linear model in which 0.5 is added to the 
observed counts, resulting in endpoints 



"°»^°-^ exp ±Z„J^+ ^ ' (26) 



^tot-^on + 0.5 V V^on + 0.5 ntot - ^on + 0.5^ 



Tang and Ng [61j, in commenting on a paper by Graham et al. [62j, examine 
several methods for intervals for the ratio of Poisson means, including several 
based on binomial methods. They prefer the adjusted Wald logit method also 
favored by Price and Bonett, citing them as the source. 

Barker and Caldwell [63j compare results of eight methods for 95% C.L. in- 
tervals, including Bayesian with uniform and Jeffreys prior. They prefer the 
Wald log linear method but do not mention the adjustment of adding 0.5 
(nor do they cite Price and Bonett); they use instead the C-P interval when 
min(non, ^tot— ^on) = 0. They found that this composite set maintains coverage 
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(even though its theoretical justification relies on asymptotic approximation) 
and generally performs better than other methods which maintain coverage. 
If some under- coverage is allowed, they favor Bayesian with uniform prior and 
the Wilson score interval. (Their criteria include length in the metric uniform 
in p.) 

Gu et al. [64] compare four general approaches via Monte Carlo simulation, 
restricting themselves to the one-sided test. They prefer a test based on a 
variance-stabilizing transformation of Huffman [65j , using an idea of Anscombe 
[66] . A likelihood ratio test is most powerful against the alternatives they 
considered, but as it can undercover they advise caution in its use. 



Cousins [60] describes his multi-dimensional Neyman construction to obtain 
a set of central confidence intervals for the ratio of Poisson means, obtaining 
intervals that are strictly conservative in unconditional coverage, and which are 
always subsets (proper subsets except for ntot = 1) of Clopper- Pearson-derived 
intervals. The unconditional coverage is obtained by averaging over conditional 
under-coverage and over-coverage at different values of ntot- When translated 
back into confidence intervals for a binomial parameter, these intervals are 
remarkably similar to mid-P intervals, as discussed below. 



In performing coverage tests, there is an issue of what to do if the data ob- 
tained has both and Uon equal to zero, so that ntot = 0. As nothing has 
been learned about the ratio, the only sensible confidence interval is (0,oc), 
which always covers the unknown true value. Cousins argues in Ref. [60] that 
such experiments should be excluded from the coverage calculation, since as 
a practical matter, "An experimenter who observes neither Poisson process 
will normally make no statement regarding the ratio of their means! Thus, 
practical confidence intervals should have the property that the requisite cov- 
erage is obtained when one considers only those experimenters who do not 
obtain (0,0)." We still believe this to be the case, but note that the coverage 
calculations of Refs. |16l61ll63j include observed data (0,0). 



In the remainder of this paper, we discuss in more detail how the ratio-of- 
Poisson-means problem allows one to evaluate binomial intervals using a fre- 
quentist average over data; we find this to be preferable to averaging over the 
binomial parameter, which requires a choice of metric (or a choice of Bayesian 
prior from which a corresponding natural metric can be inferred). We then 
perform this evaluation for a number of the available sets of intervals, and 
conclude with observations and recommendations. 
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5 Frequentist evaluation of the performance of the various recipes 



Given ntot, P, and a C.L., one can use any of the above recipes to obtain the 
set of ntot + 1 intervals [tiow^^up] for P corresponding to the possible obser- 
vations. As described and illustrated above, in the frequentist evaluation of 
these intervals, one calculates the probability Bi(non|^tot5 p) of obtaining each 
interval, and thus the probabilities that p < tiow, that p G [tiow^^up] and that 
p > tup- For the ratio of Poisson means problem, one is given (non,'^off)5 and 
a C.L., from which ntot = '^on + ^off and hence an interval for p is calculated 
and then translated into an interval for A using Eqn. [Sl For any given ptot and 



A, probabilities of obtaining (non,^off) are calculated from Eqn. 12, and thus 
unconditional probabilities for covering A can be calculated from the obtained 
interval sets and these probabilities. 



In Figs. [6] through 19, we plot the probability that the parameter is in the 



interval for methods which are among those advocated in the above references: 
Clopper-Pearson with mid-P modification, at 68.27% C.L. (Figs. |6^ and b, 
17^) and at 95% C.L. (Figs. [7^ and b); Wilson score at 68.27% C.L. (Figs. |8^ 



and b); generalized Agresti-CouU at 68.27% C.L. (Figs. ^ and b); Wald log- 



linear at 68.27% C.L. (Figs. MOk and b); A(-21n/:) at 68.27% C.L. (Figs. 11^ 



and b); exact LR test inversion, i.e., Neyman construction with likelihood- 



ratio ordering, with and without mid-P modification, at both C.L.'s (Figs. 12 



through 16); Cousins's [60j ratio-of-Poisson means translated into binomial 
at 68.27% C.L. (Figs. [17)3, [18^ and b); and Bayesian with prior uniform in 
p at 68.27% C.L. (Figs. |19^ and b). Additional plots for nearly all methods 
mentioned in this paper, for a variety of confidence levels, slices, as well as for 
one-sided probabilities, are available on request from the authors. 

A striking aspect of the two-dimensional plots is the variation of coverage, 
which is difficult to capture in tables of average coverage or rms of coverage: 
superimposed on the oscillations are evident trends which indicate regions of 
particularly low or high coverage. A number of methods give large undercov- 
er age either at low ntot or at p near the endpoints; as these values naturally 
arise in HEP applications, we disfavor such methods. 

Another significant observation is how well the mid-P methods perform in the 
unconditional coverage calculations for the ratio of Poisson means. In effect the 
Poisson ffuctuations of ntot are performing some randomization on top of the 
"mid" value (0.5) which was fixed in the mid-P calculation for fixed ntot- For 
central intervals, the result is a remarkable resemblance to the corresponding 
plots for the central intervals of Cousins |60j , which are strictly conservative for 
the ratio of Poisson means, but much less so than Clopper-Pearson intervals. 
This similarity was discovered while performing the calculations for this paper. 



and is seen in Figs. 17 and b; Figs, pk and 18 =l; Figs, pjp and [T8p; and in 
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Fig. 6. (a) Coverage of 68.27% C.L. (Clopper-Pearson) mid-P intervals, as a func- 
tion of p and ritot? and (b) unconditional coverage of the same intervals for A. A 
horizontal slice of (b) is in Fig. 17^. 




Fig. 7. (a) Coverage of 95% C.L. (Clopper-Pearson) mid-P intervals, as a function 
of p and ntot, and (b) unconditional coverage of the same intervals for A. 



numerous other plots inspected by the authors. One could imagine further 
tuning (as a function of ntot) the "mid" value of 0.5 in order to optimize 
coverage, but we did not explore this. 

The Bayesian-inspired methods perform reasonably well with the ad-hoc choice 
of using central intervals unless is or ntot, in which case the interval was 
pushed against the endpoint, as described above. This leads to over-coverage 
near the endpoints (a feature of many methods). We did not explore alterna- 
tives such as highest-posterior-density intervals. 

Among asymptotic methods, the Wilson score interval and the (preferred) 
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Fig. 8. (a) Coverage of 68.27% C.L. Wilson score intervals, as a function of p and 
ritot? and (b) unconditional coverage of the same intervals for A. 




Fig. 9. (a) Coverage of 68.27% C.L. generalized Agresti-Coull intervals, as a function 
of p and ntot, and (b) unconditional coverage of the same intervals for A. 



generalized Agresti-Coull interval appear to be reasonable for quick estimates 
as various authors have advocated. The A(— 21n>C) method undercovers at 
low ntot, and is generally not advocated in the literature reviewed. 



6 Conclusion 

While intervals such as the Wilson score and the generalized Agresti-Coull 
can be useful for hand calculations and quick estimates (and are a dramatic 
improvement over the Wald intervals) , the methods based on "exact" calcula- 

21 




Fig. 10. (a) Coverage of 68.27% C.L. Wald-log-linear intervals, as a function of p 
and ritot? and (b) unconditional coverage of the same intervals for A. 



i 




Fig. 11. Coverage of 68.27% C.L. A(— 21n>C) intervals, as a function of p and ntot, 
and (b) unconditional coverage of the same intervals for A. 



tions (i.e., using the binomial and Poisson probabilities rather than asymptotic 
or Bayesian-inspired calculations) appear to give the most reliable frequentist 
coverage. When strictly conservative coverage is desired, this statement is a 
tautology, but it also appears to be the case when approximate coverage is 
desired, if (as we advocate) the average coverage is evaluated by averaging 
over data in the closely related ratio-of-Poisson-means problem, rather than 
attempting to average over p. 

For central intervals, the original Clopper-Pearson intervals [T] remain the 
strictly conservative standard [ISj, at the cost of severe over-coverage, es- 
pecially at small ntot- Among the many variants of strictly conservative both- 
tailed (non-central) intervals, we prefer those based on likelihood-ratio- ordering^ 
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(a) 




Fig. 12. (a) Coverage of 68.27% C.L. intervals obtained from exact inversion of the 
LR test, as a function of p, for fixed ntot — 10, and (b) coverage of same intervals 



but with mid-P modification, (a) and (b) are horizontal slices of Figs. 13 =i and 15 a., 
respectively. 
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Fig. 13. (a) Coverage of 68.27% C.L. intervals obtained from exact inversion of the 
LR test, as a function of p and ntot? ^ind (b) unconditional coverage of the same 



intervals for A. A horizontal slice of (a) is in Fig. 12^ 



i.e., the intervals obtained by "exact inversion of the LR test", the method 
advocated in HEP by Feldman and Cousins [7j. The likelihood-ratio test [6] 
generalizes well to many complex, multi-dimensional problems in statistical in- 
ference [7j, and thus is well-integrated into a larger picture; when using more 
specialized ad hoc manipulations applied to the binomial problem, one is faced 
with the problem of when to abandon them (and what to replace them with) 
as more complications are added to the original simple (non,'^tot) problem. 

In the ratio-of-Poisson-means problem, we prefer making Lancaster's mid-P 
modification [32] to the construction of either set of exact intervals in the 
above paragraph. It provides remarkably good approximate coverage in the 
ratio-of-Poisson-means problem when evaluated in the unconditional ensemble 
(i.e., frequentist averaging over values of ntot other than the value observed, 
weighted by their Poisson probabilities). The mid-P intervals are strikingly 
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Fig. 14. (a) Coverage of 95% C.L. intervals obtained from exact inversion of the 
LR test, as a function of p and ntot? and (b) unconditional coverage of the same 
intervals for A. 
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Fig. 15. (a) Coverage of 68.27% C.L. intervals obtained from exact inversion of the 
LR test with mid-P modification, as a function of p and ntot? and (b) unconditional 



coverage of the same intervals for A. A horizontal slice of (a) is in Fig. 12 



3. 



similar to a set constructed by Cousins which strictly covers the ratio, but the 
mid-P intervals have a much simpler description that can also be generalized 
as complexity is added to the problem. One can also imagine contexts (such 
as estimating efficiencies of many similar detector elements) in which Poisson 
fluctuations of the number of trials in each detector element provides a sort of 
frequentist ensemble which would suggest that mid-P intervals should be con- 
sidered. However, use of mid-P intervals in a context in which there is no such 
frequentist averaging would go against the traditional conventions of HEP. 
Introduction of nuisance parameters (e.g., some systematic uncertainties) into 
the pure binomial problem, as is common in HEP, can provide another source 
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p=1/(1+^) ^^^^ p=1/(1+^) 

Fig. 16. (a) Coverage of 95% C.L. intervals obtained from exact inversion of the 
LR test with mid-P modification, as a function of p and ntot? and (b) unconditional 
coverage of the same intervals for A. 





Fig. 17. (a) Coverage of 68.27% C.L. (Clopper-Pearson) mid-P intervals, and (b) 
coverage of 68.27% C.L. intervals constructed by Cousins [60] for the ratio of Poisson 
means and translated here to intervals for p, both as a function of p, for fixed 
ntot = 10. (a) and (b) are horizontal slices of Figs. [6^ and 18 respectively. The 
remarkable resemblance is typical of that for other values of p and ntot- 



of averaging. We speculate that mid-P intervals could prove to be useful for 
obtaining good coverage in many such contexts. 

The use of these intervals can of course be considered in any application of 
binomial intervals. In high energy and astroparticle physics, the "on/off" (sig- 
nal bin plus sideband) problem was recently explored in detail by Cousins, 
Linnemann, and Tucker |67j; one of the promising methods for computing the 
statistical significance of a signal (denoted by Zbi) used the Clopper-Pearson 
interval. In some contexts it should be useful to consider as well one or more 
of the other three intervals recommended here when calculating Zbi- 
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Fig. 18. (a) Coverage of 68.27% C.L. intervals constructed by Cousins [60j for the 
ratio of Poisson means and translated here to intervals for p, as a function of p and 
ntot, and (b) unconditional coverage of the same intervals for A. A horizontal slice 



of (a) is in Fig. 173. 
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Fig. 19. (a) Coverage of intervals calculated using a Bayesian method with uniform 
prior and containing 68.27% posterior probability, as a function of p and ntot? ^ind 
(b) unconditional coverage of the same intervals for A. 
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